대규모 병렬 처리 프로그래밍: 실습 중심 접근법: SIMT 실행 모델과 워프 분할

이 SIMT(Single-Instruction, Multiple-Thread) 모델은 GPU 아키텍처의 핵심입니다. 개발자는 개별 스레드를 프로그래밍하지만, 하드웨어는 이를 두 단계 계층 구조인 그리드 와 블록으로 조율합니다. 효율을 극대화하기 위해 하드웨어는 이러한 블록을 32개 스레드로 구성된 단위로 추가로 분할하며, 이를 워프라고 합니다.

1. SIMT vs. SIMD

CPU SIMD(SSE/AVX 등)처럼 수동으로 데이터를 레지스터에 패킹해야 하는 것과 달리, SIMT는 스레드가 독립적으로 보이도록 허용합니다. 하드웨어는 자동으로 스레드를 워프로 그룹화하고, 32개의 스레드 모두가 동기화하여 실행할 수 있도록 한 명령어를 동시에 가져옵니다.

2. 선형화 규칙

개발자는 threadIdx.x, y, z 논리적 처리를 위해 사용하지만, 하드웨어는 이를 스케줄링을 위해 1차원 시퀀스로 평탄화합니다:

인덱스 = x + (y × blockDim.x) + (z × blockDim.x × blockDim.y)

왜냐하면 x 차원 이 가장 빠르게 변하는 인덱스이기 때문에, 연속적인 threadIdx.x 값을 가진 스레드는 일반적으로 같은 워프에 배치되며, 이는 메모리 코일리싱라고 합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the physical scheduling 'atom' (minimum unit) in NVIDIA's SIMT model?

A single Thread

A Warp (32 threads)

A Streaming Multiprocessor

A Grid

QUESTION 2

In an 8x8 thread block, which threads will be assigned to Warp 0?

Threads with linear IDs 0 through 31

Threads with y=0 and y=1

Only threads where x=0

All 64 threads belong to Warp 0

QUESTION 3

How does SIMT differ from traditional SIMD (like SSE/AVX)?

SIMT requires manual packing of vector registers.

SIMT allows threads to operate independently at the software level while hardware manages vectorization.

SIMD is only for GPUs; SIMT is for CPUs.

There is no difference; they are synonymous.

QUESTION 4

Using the linearization formula, what is the ID of T(2, 1, 0) in a block with blockDim.x=16 and blockDim.y=16?

QUESTION 5

What happens if threads within a warp take different execution paths (e.g., an if-else statement)?

The warp executes all paths in parallel without penalty.

The warp splits into two warps.

The hardware serializes the paths, disabling threads not on the current path.

The kernel crashes.

Architectural Analysis: Linearization and Bandwidth

Applying Warp Partitioning to Matrix Addition

You are optimizing a matrix addition kernel on a 2D grid. The threads are organized into 8x8 blocks. You are considering using shared memory to improve performance.

1. [Reading Context: Figure 6.1 shows an example of placing threads of a two-dimensional (2D) block into linear order.] Draw out/Represent the partitioning of an 8x8 thread block into warps. Which thread index (x,y) marks the end of Warp 0?

Solution:
In an 8x8 block (64 threads), threads are linearized using $Index = tx + (ty \times 8)$. Warp 0 covers Linear IDs 0 to 31. T(0,0) is ID 0. ID 31 is calculated as $tx=7, ty=3$ ($7 + 3 \times 8 = 31$). Therefore, Warp 0 starts at $T(0,0)$ and ends with $T(7,3)$. Warp 1 begins at $T(0,4)$ (ID 32) and ends at $T(7,7)$ (ID 63).

2. Consider the matrix addition where each element of the output matrix is the sum of the corresponding elements of the two input matrices. Can one use shared memory to reduce the global memory bandwidth consumption? Explain with details (approx. 150 words).

Solution:
No, shared memory cannot reduce global memory bandwidth for matrix addition. In matrix addition, each output element $C[i][j]$ is the sum of $A[i][j] + B[i][j]$. Each element from the input matrices $A$ and $B$ is accessed exactly once by exactly one thread to compute its unique output. Shared memory is a high-speed scratchpad designed for data reuse—where multiple threads within a block read the same global memory address repeatedly (as seen in matrix multiplication). Since there is zero data commonality between threads in matrix addition, loading data into shared memory first would actually increase overhead. It would require one global load to shared memory, a `__syncthreads()` barrier, and then a shared memory load before the addition. This adds instruction count and synchronization latency without decreasing the number of global memory transactions. To optimize matrix addition, one should focus on global memory coalescing rather than shared memory utilization.